Skip to content

Conversation

iamaleksey
Copy link
Member

patch by Aleksey Yeschenko; reviewed by Abe Ratnofsky for CASSANDRA-20882

@iamaleksey iamaleksey requested a review from aratno September 3, 2025 16:02
@iamaleksey iamaleksey self-assigned this Sep 3, 2025
localWitnessed = Offsets.Mutable.copy(witnessedOffsets.get(localNodeId));

witnessedOffsets.convertToPrimitiveMap(witnessed);
persistedOffsets.convertToPrimitiveMap(persisted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to make this incremental in the future, since every witnessed offset will eventually be persisted and reconciled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this one. We must overwrite the entire value of each list in the system table because of the representation, but that shouldn't be a big deal - the size of each list should be roughly the same - the head, with not gaps, collapsed into one range, plus the slightly sparse tail. I don't see how to make it incremental in this context, or what it'd bring. I could easily be missing something though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I wasn't thinking about the frozen representation at the time. Makes sense as-is for now.

persistedOffsets could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.

Copy link
Contributor

@aratno aratno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would be good to have a test extending org.apache.cassandra.fuzz.topology.TopologyMixupTestBase to test bounces as well.

Broadcasting persisted offsets every minute feels infrequent to me. The more frequently we broadcast persisted offsets, the sooner we can mark SSTables as repaired and purge CoordinatorLogOffsets. We want to avoid doing too many compactions on an SSTable before we can mark it repaired, so maybe we determine persistence and broadcast timing based on the number of mutations that have been marked reconciled, more frequent when reconciliation is happening quickly.

for (int offset = iter.start(), end = iter.end(); offset <= end; offset++)
{
ShortMutationId id = new ShortMutationId(witnessed.logId, offset);
result.addForTesting(MutationJournal.instance.read(id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid calling a *ForTesting method from a production path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That must've been a botched refactoring rename, clearly unintended. Thanks for catching it.

localWitnessed = Offsets.Mutable.copy(witnessedOffsets.get(localNodeId));

witnessedOffsets.convertToPrimitiveMap(witnessed);
persistedOffsets.convertToPrimitiveMap(persisted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I wasn't thinking about the frozen representation at the time. Makes sense as-is for now.

persistedOffsets could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.

@iamaleksey
Copy link
Member Author

iamaleksey commented Sep 19, 2025

persistedOffsets could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.

I must be missing something, but why would it grow? We aren't recording what's reconciled - we are recording what's been witnessed and written to the system table.

@aratno
Copy link
Contributor

aratno commented Sep 19, 2025

Discussed out-loud - I had some concerns about our offsets representation causing trouble if we end up with sparse representations, but that's a niche case that will impact more than just the durability paths. We should measure the logical vs. physical size of offsets, or "dropped" mutation IDs, to see if that's a problem for any real workloads. Merge away.

@iamaleksey
Copy link
Member Author

Committed as bd5a657, cheers.

@iamaleksey iamaleksey closed this Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants